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Meinshausen and Buhlmann [Ann. Statist. 34 (2006) 1436-1462] 
showed that, for neighborhood selection in Gaussian graphical mod- 
els, under a neighborhood stability condition, the LASSO is con- 
sistent, even when the number of variables is of greater order than 
the sample size. Zhao and Yu [(2006) J. Machine Learning Research 
7 2541-2567] formalized the neighborhood stability condition in the 
context of linear regression as a strong irrepresentable condition. That 
paper showed that under this condition, the LASSO selects exactly 
the set of nonzero regression coefficients, provided that these coeffi- 
cients are bounded away from zero at a certain rate. In this paper, 
the regression coefficients outside an ideal model are assumed to be 
small, but not necessarily zero. Under a sparse Riesz condition on 
the correlation of design variables, we prove that the LASSO selects 
a model of the correct order of dimensionality, controls the bias of 
the selected model at a level determined by the contributions of small 
regression coefficients and threshold bias, and selects all coefficients 
of greater order than the bias of the selected model. Moreover, as a 
consequence of this rate consistency of the LASSO in model selection, 
it is proved that the sum of error squares for the mean response and 
the £a-loss for the regression coefficients converge at the best possible 
rates under the given conditions. An interesting aspect of our results 
is that the logarithm of the number of variables can be of the same 
order as the sample size for certain random dependent designs. 

1. Introduction. Consider a linear regression model 

p 

(1.1) yi = J2xijPj i = l,...,n, 
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where yi is the response variable, covariates or design variables and et 

is the error term. In many applications, such as studies involving microarray 
or mass spectrum data, the total number of covariates p can be large or 
even much larger than n, but the number of important covariates is typically 
smaller than n. With such data, regularized or penalized methods are needed 
to fit the model and variable selection is often the most important aspect of 
the analysis. The LASSO [Tibshirani (1996)] is a penalized method similar 
to the ridge regression but uses the Li-penalty Y^^j=i\(^j\ instead of the 
L2-penalty Y^^j=i0j- An important feature of the LASSO is that it can 
be used for variable selection. Compared to the classical variable selection 
methods, such as subset selection, the LASSO has two advantages. First, 
the selection process in the LASSO is based on continuous trajectories of 
regression coefficients as functions of the penalty level and is hence more 
stable than subset selection methods. Second, the LASSO is computationally 
feasible for high-dimensional data [Osborne, Presnell and Turlach (2000a, 
2000b), Efron et al. (2004)]. In contrast, computation in subset selection is 
combinatorial and not feasible when p is large. 

Several authors have studied the model-selection consistency of the LASSO 
in the sense of selecting exactly the set of variables with nonzero coeffi- 
cients, that is, identifying the subset 7^0} of {!,..., p}. In the low- 
dimensional setting with fixed p. Knight and Fu (2000) showed that, under 
appropriate conditions, the LASSO is consistent for estimating the regression 
parameters /3j and their limiting distributions can have positive probability 
mass at when (3j = 0. However, careful inspection of their results indicates 
that the positive probability mass at is less than 1 in the limit for certain 
configurations of the covariates and regression coefficients, which suggests 
that the LASSO is not variable-selection consistent without proper assump- 
tions. Leng, Lin and Wahba (2006) showed that the LASSO is, in general, 
not variable-selection consistent when the prediction accuracy is used as 
the criterion for choosing the penalty parameter. On the other hand, Mein- 
shausen and Buhlmann (2006) showed that, for neighborhood selection in 
the Gaussian graphical models, under a neighborhood stability condition on 
the design matrix and certain additional regularity conditions, the LASSO 
is consistent, even when the number of variables tends to infinity at a rate 
faster than n. Zhao and Yu (2006) formalized the neighborhood stability con- 
dition in the context of linear regression models as a strong irrepresentable 
condition. They showed that under this crucial condition and certain other 
regularity conditions, the LASSO is consistent for variable selection, even 
when the number of variables p is as large as exp(n°') for some < a < 1. 
Thus, their results are applicable to high-dimensional regression problems, 
provided that the conditions, in particular, the strong irrepresentable con- 
dition, are reasonable for the data. 
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In this paper, we provide a different set of sufficient conditions under 
which the LASSO is rate consistent in the sparsity and bias of the selected 
model in high-dimensional regression. The usual definition of sparseness for 
model selection, as used in Meinshausen and Buhlmann (2006) and Zhao 
and Yu (2006), is that only a small number of regression coefficients are 
nonzero and all nonzero coefficients are uniformly bounded away from zero 
at a certain rate. Thus, variable selection is equivalent to distinguishing 
between nonzero and zero coefficients with a separation zone. We consider a 
more general concept of sparseness: a model is sparse if most coefficients are 
small, in the sense that the sum of their absolute values is below a certain 
level. Under this general sparsity assumption, it is no longer sensible to 
select exactly the set of nonzero coefficients. Therefore, in cases where the 
exact selection consistency for all Pj 7^ is unattainable or undesirable, we 
propose to evaluate the selected model with the sparsity as its dimension and 
the bias as the unexplained part of the mean vector and the missing large 
coefficients. As our goal is to select a parsimonious model which approximate 
the truth well, the sparsity and bias are suitable measures of performance. 
This is not to be confused with criteria for estimation or prediction, since we 
are not bound to use the LASSO for these purposes after model selection. 

Under a sparse Riesz condition which limits the range of the eigenvalues 
of the covariance matrices of all subsets of a fixed number of covariates, we 
prove that the LASSO selects a model with the correct order of sparsity and 
controls the bias of the selected model at a level of the same order as the 
bias of the LASSO in the well-understood case of orthonormal design. Con- 
sequently, the LASSO selects all variables with coefficients above a threshold 
determined by the controlled bias of the selected model. In this sense, and in 
view of the optimality properties of the soft threshold method for orthonor- 
mal designs [Donoho and Johnstone (1994)], our results provide the rate 
consistency of the LASSO for general designs under the sparse Riesz con- 
dition. As mentioned in the previous paragraph, the LASSO does not have 
to be used for estimation and prediction after model selection. Nevertheless, 
we show that the rate consistency of the LASSO selection implies the con- 
vergence of the LASSO estimator to the true mean Eyi and coefficients Pj 
at the same rate as in the case of orthonormal design. 

When the number of regression coefficients exceeds the number of obser- 
vations {p > n), there are potentially many models fitting the same data. 
However, there is a certain uniqueness among such models under sparsity 
constraints. Under the sparse Riesz condition, all sets of q* design vectors 
are linearly independent for a certain given rank q* so that the linear com- 
bination of design vectors is unique among all coefficient vectors of sparsity 
q* /2 or less. Moreover, our rate consistency result proves that under mild 
conditions, the representation of all coefficients above a certain threshold 
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level is determined in the selected model with high probability. Of course, 
such uniqueness is invalid when the sparsity assumption fails to hold. 

We describe our rate consistency results in Section 2 and prove them 
in Section 5. Implications of the rate consistency for the convergence rate 
of the LASSO estimator are discussed in Section 3. The sparse Riesz and 
strong irrepresentable conditions do not imply each other in general, but 
the sparse Riesz condition is easier to interpret and less restrictive from a 
practical point of view. In Section 4, we provide sufficient conditions for the 
sparse Riesz condition for deterministic and random covariates. In Section 6, 
we discuss some closely related work in detail and make a few final remarks. 

2. Rate consistency of the LASSO in sparsity and bias. The linear mo- 
del (1.1) can be written as 

p 

(2.1) y = 5]/3,x,+£ = X/3 + e, 

where y = (yi , . . . , y^)', are the columns of the design matrix X = {xij)nxp, 
f3 = (/3i, . . . , PpY is the vector of regression coefficients and e = (ei, . . . , £„)'• 
Unless otherwise explicitly stated, we treat X as a given deterministic ma- 
trix. 

For a given penalty level A > 0, the LASSO estimator of (3 &W is 

(2.2) 3 ^ 3(A) ^ argmm{||y - + A||/3||i}, 

where || • || is the Euclidean distance and ||/3||i = J2j is the £i-norm. In 
this paper, 

(2.3) A^A{X)^{j<p:(3,^0} 

is considered as the model selected by the LASSO. 

As mentioned in the Introduction, we consider model selection properties 
of the LASSO under a sparsity condition on the regression coefficients and a 
sparse Riesz condition on the covariates. The sparsity condition asserts the 
existence of an index set Aq C {1, . . . ,p} such that 

(2.4) #{j<p:j^Ao} = q, 

Under this condition, there exist at most q "large" coefficients and the ii 
norm of the "small" coefficients is no greater than rji. Thus, if q is of smaller 
order than p and rji is small, then the high-dimensional full model X/3 with 
p coefficients can be approximated by a much lower-dimensional submodel 
with q coefficients so that model selection makes sense. Compared with the 
typical assumption 

(2.5) \Ap\=q, Af, = {j:pj^O} 
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for model selection, (2.4) is mathematically weaker and much more realistic 
since it specifies a connected set in the parameter space MP of /3. Let (j) be 
the orderings giving > • • • > |/3(p)|- Another way of stating (2.4) is 

(2.6) j2 \(^U)\<Vi, Ao^{{q + l),...,{p)}. 

j=q+l 

What should be the goal of model selection under the sparsity condition 
(2.4)? Unlike the usual case of (2.5), condition (2.4) allows potentially many 
small coefficients so that it is no longer reasonable to select exactly all vari- 
ables with nonzero coefficients. Instead, a sensible goal is to select a sparse 
model which fits the mean vector X/3 well and thus includes most (all) vari- 
ables with (very) large \f3j\. Under the sparsity assumption (2.4), a natural 
definition of the sparsity of the selected model is q = 0{q), where 

(2.7) q^q(X)^\A\=#{j:p,^0}. 
The selected model fits the mean X/3 well if its bias 

(2.8) B^B{X)^\\{I-P)Xf3\\ 

is small, where P is the projection from to the linear span of the set of 
selected variables Xj and I = I„, is the n x n identity matrix. Since the bias 
B is defined as the length of the difference between X/3 and its projection 
to the image of P, B^ is the sum of squares of the part of the mean vector 
not explained by the selected model. To measure the large coefficients for 
variables missing in the selected model, we define 

(2.9) Ca = CaW=(Y.W^i^j = ^}] ' 0<a<oo. 

Under (2.6), Co is the number of the p largest not selected, C2 is the 
Euclidean length of these missing large coefficients and Coo is their maximum. 
What should be the correct order of B and Ca? Example 1 below indicates 
that under the conditions we impose, the following three quantities, or the 
maximum of the three, are reasonable benchmarks for B^ and n(^|: 



(2.10) Xm,vi 



2 qy 



n 



where r/2 = max^cAo II EjeA/^jXj || < maxj<p ||xj||r/i. 

Example 1. Suppose we have an orthonormal design with X'X/n = Ip 
and i.i.d. normal error £~A^(0,I„). Then, (2.2) is the soft-threshold esti- 
mator [Donoho and Johnstone (1994)] with threshold level \/n for the indi- 
vidual coefficients: j3j = sgn(zj)(|zj| — A/ji)"*", with Zj = x^y/n N{fjj, 1/n) 
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being the least-squares estimator of /3j. If \(3j \ = X/n for j = 1, . . . ,q + rjin/X 
and X/^/n oo, then P{/3j = 0} 1/2 so that 2~^{q + i]in/X)n{X/nf = 
2-\qXyn + r]iX). 

In this example, we observe that the order of cannot be smaller than 
the first and third quantities in (2.10), while the second quantity t/2 is a 
natural choice of as the maximum mean effect of variables with small 
coefficients. In the proof of Theorem 1 in Section 5 (Remark 8), we show that 
y/nC2 is of order no greater than B + 7]2. Thus, we say that the LASSO is 
rate-consistent in model selection if, for a suitable a (e.g., a = 2 or a = co), 

(2.11) q = 0{q), B = Op{B), V^C^ = 0{B), 

with the possibility of -B = 0(?]2) and Ca = under stronger conditions, 
where B = max(y^r/iA, r]2, \/qX''^/n). 

As we mentioned earlier, the main result of this paper proves the rate- 
consistency of the LASSO under (2.4) and a sparse Riesz condition on X. 
The sparse Riesz condition controls the range of eigenvalues of covariate 
matrices of subsets of a fixed number of design vectors Xj. For Ac {1, . . . ,p}, 
define 

(2.12) = (x,-, j eA), T,A = ^A^A/n. 

The design matrix X satisfies the sparse Riesz condition (SRC) with rank 
q* and spectrum bounds < < c* < oo if 

IIX viP 

(2.13) c* < " ,f Jl < c* with \ A\=q* and v G M** . 

n||v||^ 

Since ||X^v|p/?i = v'Sl^v, all the eigenvalues of are inside the interval 
[c*,c*] under (2.13) when the size of A is no greater than q* . While the 
Riesz condition asserts the equivalence of a norm HX^j'^jG II ^he £2 
norm (X^j^j)^^^ entire (infinite-dimensional) linear space with basis 

{^i)^2i • • •}) the SRC provides the equivalence of the norm ||S^/^v|| and the 
^2 norm ||v|| only in subspaces of a fixed dimension in a fixed coordinate 
system. The quantities c* and c* have been considered as sparse minimum 
and maximum eigenvalues [Meinshausen and Yu (2006), Donoho (2006)]. 
We call (2.13) the sparse Riesz condition due to its close connection to the 
Riesz condition as discussed above and in Section 4.2. 

We prove the rate consistency (2.11) for the LASSO under the sparsity 
(2.4) and SRC (2.13) conditions if they are configured in certain ways be- 
tween themselves and in relation to the penalty level A. These relationships 
are expressed through the following ratios: 



(2.l4)n^n(A). ^ 
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where {g, r/i, ry2, c*, c*} are as in (2.4), (2.10) and (2.13). The quantities in 

(2.14) are invariant under scale changes {X, e, ri2, y^, \/c*, \/A} {X, e, rj2, 
y/cl,\f(f ,y/X} /a and {£,/3, r/i, 772, A} — > {/3, r/i, 772, A}/cr. Up to the factor c* 
for scale adjustment, r\ and are the ratios of the first two benchmark 
quantities to the third in (2.10). In terms of these scale invariant quantities, 
we explicitly express in our theorem the 0(1) in (2.11) as 

(2.15) Ml = Ml (A) = 2 + Arl + Ay/Cr2 + 4C, 

(2.16) M2* = M2*(A) = |{i + r? + r2V2C{l + VC) + C{\ + |C)} 



and 



(2.17) 



M3* = M3* {\) = ^\^+rl + r2^{l+ 2V1 + C) 



3r2 /7 2^ 



Note that the quantities rj and in (2.14)-(2.17) are all decreasing in A. 
We define a lower bound for the penalty level as 

(2.18) A* =inf{A:Mi*(A)g + l <g*}, inf0 = oo. 

Let a = {E\\£f/nY/'^. With the A* in (2.18) and c* in (2.13), we consider 
the LASSO path for 



(2.19) A > max(A*, A„,p), A„_p = 2(7^ 2(1 + co)c*nlog(p V a„), 

with Co > and a„ > satisfying p/{p V a^)^^'^" ~ 0. For large p, the lower 
bound here is allowed to be of the order A„^p ~ ^77, log p with o„ = 0. For 
example, A* < A„,,p if (2.13) holds for > (6 + 4\/C + AC)q + 1, 771 < 
q^n,p/{nc*) and 772 < q\^p/{nc*), up to ri = r2 = 1 in (2.14). For fixed 
p, a„ — > 00 is required. For i.i.d. normal errors and large p, the false dis- 
covery increases dramatically after the LASSO path enters the region A < 
ayJ2n logp, at least in the orthonormal case. 

Theorem 1. Let q{\), B{\) and C2W he as in (2.7), (2.8) and (2.9), 
respectively, for the model A{\) selected by the LASSO with (2.2) and (2.3). 
Let M* be as in (2.15), (2.16) and (2.17). Suppose e ~ Af(0, cj^I), q>l, 
and the sparsity (2.4) and sparse Riesz (2.13) conditions hold. There then 
exists a set Qq in the sample space of (K,e/a), depending on {X/3,co,a„} 
only, such that 

2p 

(jp V ^1+^0 J {jpy an) 



(2.20) P{(X, el a) G J^o} > 2 - exp( TTTTT^f^ " TZTTTTr^, ^ 1 
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and the following assertions hold in the event (X.,e/a) G for all A satis- 
fying (2.19): 

(2.21) g(A)<g(A)^#{i:^,(A)/0 or j i A^] < Ml{\)q, 

(2.22) B\X) = 11(1 - P(A))X/3f < M^A)^, 

with P(A) being the projection to the span of the selected design vectors 
{xj,j £ A(A)} and 

(2.23) Cl(A) = E l/3.l'^{^.(A) = 0} < M3*(A)-^. 

Remark 1 . The condition g > 1 is not essential since it is only used to 
express quantities in Theorem 1 and its proof in terms of ratios in (2.14). 
Thus, (2.21), (2.22) and (2.23) are still valid for g = if we use rfq = c'rjin/X 
and r^g = c*rj2n/\^ to recover M^q from (2.15), (2.16) and (2.17), resulting 
in 

g(A)<4c*^, B\X)<^-m\ Cl = 0. 

Remark 2. For r/i = in (2.6), we have ri = r2 = and 

Ml = 2 + 4C, 

(2.24) M2* = ^ + fc^ 

M3* = | + fC+fC7^ 

all depend only on C = c*/c* in (2.14). In this case, (2.18) gives A* = 
for (2 + ^C)q + 1 < q* and A* = oo otherwise. Thus, Theorem 1 requires 
(2 + 4C)g + 1 < in (2.4) and (2.13). 

Remark 3. The conclusions of Theorem 1 are valid for the LASSO 
path for all A > max(A*, An,p) in the same event (X,£/(t) G Q,q. This allows 
data-driven selection of A, for example, cross-validation based on prediction 
error. However, the theoretical justification of such a choice of A is unclear for 
model-selection purposes. Theorem 1 and simple calculation for orthonormal 
designs indicate that A^^p is a good choice for model selection when A„^p > A*, 
provided we have some idea about the unknown q and "known" {c*,c*,g*}. 

Theorem 1 is proved in Section 5. The following result is an immediate 
consequence of it. 
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Theorem 2. Suppose the conditions of Theorem 1 hold. Then, all vari- 
ables with /3j > M|(A)gA^/{c*c*n^} are selected with j G A{X), provided 
(K,e/a) £ r^o o.'iT'd A is in the interval (2.19). Consequently, if /?? > 
M^{\)q\^ / {c* c^v?] for all j ^ Aq, then, for all a > 0, 

P{AlciA,B{\)<r]2 andCa(A) = 0} 

(2.25) 

Theorems 1 and 2 provide sufficient conditions under which the LASSO 
is rate-consistent in sparsity and bias in the sense of (2.11). It asserts that, 
with large probabihty, the LASSO selects a model with the correct order of 
dimension. Moreover, with large probabihty, the bias of the selected model 
is the smallest possible 772 in the best scenario when all the large coefficients 
are above an explicit threshold level, and in the worst scenario, the bias is 
of the same order as what would be expected in the much simpler case of 
orthonormal design. Furthermore, with large probability, all variables with 
coefficients above the threshold level are selected, regardless of the values of 
the other coefficients. The implications of Theorem 1 on the properties of 
the LASSO estimator are discussed in Section 3. 

In Theorems 1 and 2, conditions are imposed jointly on the design X and 
the unknown coefficients /3. Since X is observable, we may think of these 
conditions in the following way. We first impose the SRC (2.13) on X. Given 
the configuration {q* ,c^,c*} of the SRC and thus C = c* /c^, (2-18) requires 
that {q,ri,r2} satisfy (2 + 4rf + 4v/Cr2 + 4C)g + 1 < q^*. Given {q,ri,r2} 
and the penalty level A, the condition on (3 becomes 

c n c n 

Since Theorems 1 and 2 are valid for any fixed sample (with the exception 
of the "w 1" parts), (7*, c*, c*, ri and r2 are all allowed to depend on n, but 
they could also be considered as fixed. 

The constant factors Mj in Theorem 1 are not sharp since crude bounds 
(e.g., Cauchy-Schwarz) are used several times in the proof. However, The- 
orem 1 is valid for any fixed (n,p) with the specified configurations of the 
sparsity and sparse Riesz conditions. Thus, it is necessarily invariant under 
the scale transformations (X,£) — > {X.,e)/a and {(3',e') — > {P',e')/a. 

The SRC (2.13) is studied in Section 4 for both deterministic and random 
covariates. Under the Riesz condition on an infinite sequence of Gaussian 
covariates, we prove that (2.13) holds with fixed < c* < c* < 00 and q* = 
aon/{l V log(p/n)} with large probability as {n,p) (00,00) (cf. Remark 
6). This allows the application of Theorem 1 with p as large as exp(an) for a 
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small fixed a > 0. Section 6 contains additional discussion of our and related 
results after we study the LASSO estimation and SRC and prove Theorem 
1. 



3. The LASSO estimation. Here, we describe implications of Theorems 1 
for the estimation properties of the LASSO. For simplicity, we confine this 
discussion to the special case where c*, c*, n, r2, cq and a are fixed and 
X/y/n > 2ay/2(l + co)c* logp oo. In this case, are fixed constants in 
(2.15), (2.16) and (2.17), and the required configurations for (2.4), (2.13) 
and (2.19) in Theorem 1 become 

(3.1, Mr,.i,,.. .,(i)^, ,B(i)e 

Of course, p, q and q* are all allowed to depend on n: for example, n> 
q* > q^oo. 

Let Ai = {j : (3j{X) / or j ^ ^o}- Set Xi = X^^ and Sn = YIai as in 
(2.12). Define bi = {bjj e ^i)' for ah b e M^. Consider the event (X,£/(j) e 
Qq in Theorem 1, in which \ Ai \ < Mfq. Since Sn > c* by the SRC (2.13), 
the vector vi = Xi(/3;^ — (3i) satisfies 

(3.2) llvif = n||sjf (3i - P,)f > c,n||3i - Pif. 

The inner product of (3i — f3i and the gradient gi = X'^(y — X/3) is 
01 - PiYsi = vi(y - Xi3i) = v'i(X/3 - Xi/3i + e) - ||vi f . 
Since ||gi||oo < X, and ||X/3 - Xi/3i|| < r/2, 

llvill < ||X/3-Xi/3i +Pie|| +7i-V2||5^-V2g^|| 

(3.3) 

<?72 + ||Pie|| +A 



where Pi = X'^Sjj^'^Xi/n is the projection to the range of Xi. Since rank(Pi) = 
l^il < M^q, we are able to show that ||Pie|| is of the order a^/qJogp under 
the normality assumption. Thus, (3.2) and (3.3) lead to Theorem 3 below. 
The inequality (2.21) plays a crucial role here since it controls \Ai\ and then 
allows the application of the SRC. 

Theorem 3. Let c*, c*, ri, r2, cq and a be fixed and 1 < q < p ^ oo. Let 



X = 2cry 2(1 + CQ)c*nlogp with a fixed c'q > cq and fio be as in Theorem 
1. Suppose the conditions of Theorem 1 hold with configurations satisfying 
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(3.1). There then exist constants depending only on c*,c*,ri,r2 and Cq 
and a set Qq in the sample space of (X,e/cj) depending only on q such that 

and the following assertions hold in the event (X,e/cr) G ilo H Qq: 

(3.5) ||X(3-/3)||<M4VVgl^ 
and, for all a>l, 

(3.6) WP-pW^^l^pjp^-p^lo'Y <M*aq'/(-^^^^ilogp)/n. 

Remark 4. The convergence rates in (3.5) and (3.6) are sharp for the 
LASSO under the given conditions since the convergence rate for (3.6) is 
q^^"{\/n + a/y/n), 1 < a < 2, for orthogonal designs and the bias for asingle 
Pj could be of the order \/g(logp)/n, even under the strong irrepresentable 
condition. Moreover, by Foster and George (1994), the risk inflation factor 
\/logp is optimal for (3.5) and (3.6) with a = 2. We discuss related work in 
Section 6 after we study the SRC and prove Theorem 1. 

Proof of Theorem 3. Define Pa = X^S^^Xyi/n with the notation 
in (2.12) and 

^q<\'A\<p a'^\A\ 

For deterministic A with rank(X^) = m, ||P^e|p/(T^ ~ Xm that 

P{\\PAef/a^ > m{l + 41ogp)} < {^"^(1 + Alogp)r^^, 
by the standard large deviation inequality. It follows that 



= < max — 21 /t I — 41ogp 



l-P{nq}< Y: f^)i^'"'(l+4l0gp)r/'<f^ + ^2 



m=q+l 



p. p. 



due to the facts that (^) < p'^/ml and 1 + 41ogp < p^. Since q + I < q*, 
the arguments for (3.2) and (3.3) are still valid if we require \Ai\ > q + 1 
(making Ai larger). Thus, (3.5) follows from (2.21) and (3.3), due to ||Pie|| < 
2a^/\Ai\]ogp in Qq. Similarly, by both (3.2) and (3.3), we have, in n Qq, 

f E 1^7- -/S.f) ^ <0{l)a^\A,\{logp)/n 
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uniformly. Thus, since Aq, (3.6) follows from 

1/q 



(3.7) El/^^l" <0(l)agi/(-A2)^(i^g^)/^ 

\jGAo I 

for a = 1,2 and a = oo. For a= 1, (3.7) follows from the second inequality 
of (3.1). For a = 2, #{j G Aq : > X/n} = 0{q), by (3.7) for ct = 1, so that, 
by the SRC (2.13) and the third inequality of (3.1), 

2 

E Pp{\p,\>X/n}<0{l/n) 

<0{r,l/n)=0{q\^/n'). 

Thus, (3.7) for a = 2 follows from a = 1. Finally, (3.7) for a = co follows 
3|<ll/5.Xil 



E /3,x,/{|/5,|>A/n} 



from < ||/3j-xj II V(nc,) < r/i/(nc,). □ 



4. The sparse Riesz condition. In this section, we provide sufficient con- 
ditions for the sparse Riesz condition. We divide the section into two subsec- 
tions respectively for deterministic and random design matrices X. In the 
case of random design, the rows of X are assumed to be i.i.d. vectors, but 
the entries within a row are allowed to be dependent. 

We consider the sparse Riesz condition (2.13) and its general version 

(4.1) c=i,(?n) = min min ||X^v|p/n, c*(m) = max max ||Xyiv|p/?i, 
A|=m ||v||=l |A|=m||v||=l 



for ranks < m < p, with the convention that c*(0) = c*(0) = v^c*(l)c*(l). 
This includes (2.13) with c* = c^,{q*) and c* = c*{q*). As we mentioned ear- 
lier, (4.1) reduces to the requirement that all of the eigenvalues of J^a in 
(2.12) lie in the interval [c^{m) , c* {m)] when \A\ < m. If Xj are standardized 
with sl^y^jln = 1, then c*(l) = c*(l) = 1. In general, c*(l) < ||xj|p/n < c*(l). 
It is clear that c*(m) is decreasing in m with c^,(n + 1) = 0, c*(m) is in- 
creasing in m and the Cauchy-Schwarz inequality gives the subadditivity 
c*(mi -I- m2) < c*{mi) + c*{m2)- 

4.1. Deterministic design matrices. Proposition 1 below provides a sim- 
ple sufficient condition for (2.13). It is actually an ^Q-version of Gersgorin's 
theorem. 

Proposition 1. Suppose that X is standardized with ||xj|p/n = l. Let 
Pjl^ = x'jXi^/n be the correlation. If 



\A\-q a>L 
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then the sparse Riesz condition (2.13) holds with rank and spectrum 
bounds c^ = \ — 5 and c* = 1 + 6. In particular, (2.13) holds with c* = 1 — 5 
and c* = 1 + 5 if 

(4.3) max \pjk\<- — <5 < 1. 

i<3<K<p q — 1 

Remark 5. If (5 = 1/3, then C = c*/c* = 2 and Theorem 1 is apphcable 
for (1.1) if lOg + 1 < q* and r/i = in (2.4). 

Proof of Proposition 1. Let = {pjk)j^A,k£A be the covariance 
matrix for variables in A, as in (2.12). Let \A\ = q* and h = (bi, . . . ,bq-*) be 
an eigenvector of Sa with eigenvalue r. Then, 

+ J2 Pjk^k = Tbj 
kj^j,keA 

SO that, by the Holder inequahty. 



j&A jeA 



E PJkh 



jGA \k^j J keA 



After the cancehation of X^fceA find, by (4.2), that |1 — t| < 6. This 

gives (2.13) with c* = 1 — (5 and c* = 1 + 5 as the interval [c^,, c*] contains all 
eigenvalues of with \A\ = q*. If (4.3) holds, then, as a — > oo. 



iGA\keA,k^j / ) 1 ^ 

= <5(g*)i/°(g*-l)^i/°^<5. 



The proof of Proposition 1 is complete. □ 

4.2. Random design matrices. Suppose we would like to investigate the 
linear relationships between a response variable Y and infinitely many pos- 
sible covariates {^fc)^ = 1,2,...}. Suppose that in the nth experiment, we 
collect a sample from the dependent variable Y and p covariates so that we 
observe n independent copies {v^'^\x^l^\j = 1, . . . ,p^"^) of the random vector 

iXi^kjii = 1) • • • jP) for certain ki < ■ ■ ■ < kp, p = p('"). In this case, the linear 
model (1.1) becomes 



(4.4) y'r^ = T.pf-tf^ + 4^- 

In what follows, the superscript is often omitted. 
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The infinite population sequence = 1,2,...} satisfies the Riesz con- 
dition if there exist fixed < < p* < cxo such that 



(4.5) 



An) 



for all constants bj . Let x* = {xl^ , . . . , xl^ ) be the row vectors of X = 

(^lj^)nxp = (xi, . . . ,Xp) in (4.4). Since x*, i = 1, . . . , n, are i.i.d. copies of 
(Cfci , • ■ • , 6p), (4.5) implies that 



p*||bf < E 



^^(bV)^ 



^lixbip 



i=l 



n 



n 



<p*\\h\\ 



However, this does not guarantee that < k < c^=(m) < c*(m) < 1/k with 
large probability for all m. In particular, we always have c*(n+ 1) = 0. 

Proposition 2. Suppose that the n rows of a random matrix ^nxp o,re 
i.i.d. copies of a subvector {Cki, ■ ■ ■ ,S,kp) of a zero-mean random sequence 
= 1,2,. . .} satisfying (4-5). Let c^{m) andc*{m) be as in (4-1)- 

(i) Suppose {^fc,fc> 1} is a Gaussian sequence. Let e^, k = 1,2,3,4, 
be positive constants in (0,1) satisfying m < min(p, efn), ei + 62 < 1 and 
€3 + £4 = €2/2. Then, for all {m,n,p) satisfying log (^) < £371, 

(4.6) P{np* < c*(m) < c*(m) < t* p*} > 1 - 2e~"^S 

where r* = (1 — ei — £2)^ and r* = (1 + ei + £2)^. 

(ii) Suppose maxj<p \\S,kj \\oo < Kn < 00. Then, for any r* < 1 < t* , there 
exists a constant £0 > depending only on p*,p^:,T^, and r* such that 

P{t*p* < c^{m) < c*{m) < T*p*} 1 

for m = nin < eoK~^ ^/n/logp, provided \/nlKn — > 00. 

Remark 6. By the Stirling formula, for p/n ^ 00, 



< £3"-/ log(p/n) =^ log 



<(£3 + o(l))n. 



Thus, Proposition 2(i) is applicable up to p = e"" for some small a > 0. 

Remark 7. Supposing m = p, p/n — > £f G (0, 1) and are i.i.d. A^(0, 1), 
Geman (1980) proved c*(m) (1 + £1)^ and Silverstein (1985) proved p* 
(1 — £1)^. Silverstein's results can be directly used to prove bounds similar 
to (4.6) [cf. Zhang and Huang (2006)]. We refer to Bai (1999) and Davidson 
and Szarek (2001) for further discussion on random covariance matrices. 
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Proof of Proposition 2. (i) Let S"^~^ be the unit sphere of and 
Fm '■ — *■ he m X p projection matrices taking m out of p coordinates 
of RP. Define 

T_ P„ = inf " ^11, , r+ P„ ^ sup ^ii^p"^ 

Since < £;||XP;^b|| V"- < P*, by (4.1), we have 

Pm,n,p = P{np* < c^{m) < c*{m) < T*p*} 

(4.7) 

> T* < minr„(Pm) < maxr+(Pm) < t* k 

For a fixed P^, let Sim be the m x m population covariance matrices of 
the rows of XP^ and U = XP^S~^/^. Since U is then an n x m matrix of 
iV(0,l), 

T+(Pm)= sup „ I = sup ||Ub||Vn = Amax(W/n) 

be5'"-i n||S„/ b||^ bes-"-! 

and r_(Pm) = Aniin(W/n), where W = U'U is an m x m matrix with the 
Wishart distribution Wm(I-,n) [cf. Eaton (1983)]. Since m/n < e\, for the 
prescribed t^, and r*, Theorem 11.13 of Davidson and Szarek (2001) gives 

max(P{A„,in(W/n) < n}, P{A^ax(W/n) > r*}) < e-"^^/^^ 

Thus, since there exist a total of (^) choices of P^, by (4.7), 

- Pm.,n.p < ( ^ ) (1 - P{n < A^i„(W/n) < A^ax(W/n) < r*}) 

(4.8) 

<2( ^ I e~"'2/2 < 2e~"'^ 
\m J 

(ii) Define /„(b) ^ {WXP'Jaf /nfl^ and /(b) ^ {Efl{h)Y/\ By (4.5), 
/^(b)/||b|p G [p*,p*] for ah b / 0. Since both /„ and / are norms in R™, 



/„(b + b) /,,(b) 



^ //n(b) ^ /„(b) \ /(b)^ 
- l/(b) /(b) ^/(b + b)' 



/(b + b) /(b) 

Let S"^-^ be an ei-net in S"™"! with 2ei v/pVpT < 1/5. We have 

bG5™-^ /(b) ^ \ p* 4bes™-i /(b) 

and 

ry^(P.)> mm^4i^-lrr(P.). 
bg5™-i /(b) 5 
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Since /^(b)//^(b) is the average of n i.i.d. variables, each with mean 1 and 
uniformly bounded by mK"^/ p^,, by the Bernstein inequahty, we have 

^{|/n(b)//'(b)-l|>7/25}<2exp 



m 



for certain €2 depending on only. Thus, for r* = (5/4)^(1 + 7/25) = 2 and 
n = - 7/25 - ^2/5)2 = 8/25, we have 

l-P,„.2(,-„)|5.r'|exp(3| 

Since |<S'^~^|/m! = 0(1), Pm,n,p — > 1 for e2n/{mK^) > 2mlogp. This proves 
(ii) for the specific {r^,,r*}. We omit the proof for the general {r*,r*}. □ 

5. Proof of Theorem 1. Taking the scale change {e, /3, A} ^ {e/cj, /3/(J, A/fj} 
if necessary, we assume e ~ iV(0, 1), without loss of generahty. It follows from 
the Karush-Kuhn-Tucker condition that a vector b = (61, . . . , bp)' is the so- 
lution (3 of (2.2) and only if 



(5.1) 



x;.(y-Xb)=sgn(6,)A, \bj\>0, 
|x;.(y-Xb)|<A, bj = 0. 



This allows us to define slightly more general versions of the A in (2.3) and 
its dimension as 

(5.2) {j:^,/0}C^iC{j:|x;.(y-X3)|=A}UAg, qi=\Ai\. 

Set A2 = {l,...,p}\Ai, A3 = Ai\ Aq, ^4 = ^1 n Aq, ^ = ^2 \ ^0 and 
Aq = A2r\ Aq. For Ak C Aj, let Q^j be the matrix representing the selection 
of variables in Ak from Aj, defined as Q,kj(3j = /3fc) where Pf. = {(3j,j G 
Ak). For example, f3[ = P'^Qsi + /94Q41 since Ai = A3U A4 and ^3 n A4 = 
0. We define matrices Ylj^ = n~^X^Xfc, and the projection Pi from 
to the span of {xj,j £ Ai}. We apply all arithmetic and logic operations 
and univariate functions to vectors componentwise. For example, v x \(3\ = 
{vi\f3i\, . . . ,Vp\(3p\y . The SRC (4.1) for a general rank m is used in most 
parts of the proof, rather that (2.13). Table 1 summarizes the meanings of 
the index sets Aj . 

We note that q = qi and P = Pi when we choose the smallest possible 
Ai in (5.2) and that A5 = when we choose the largest possible Ai. In 
our analysis of the LASSO, quantities related to the coefficients in the sets 
^j, J = 0,1,2, are often decomposed into those involving the more specific 
sets Aj,j = 3, 4, 5, 6. 

It fohows from (5.1) that 

(5.3) s,=XV(y-X3)/AG[-l,l], j = 1,3,4. 
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Table 1 

Sets of variables considered in the proof 





"Large" |/3j| 


"Small" \f3j\ 


Quantities 




3 ^ Ao 


j G Ao 


to be bounded 


Ai : selected j and some j ^ Ao 


A3 


Ai 


q<qi = \Ai\ 


A2 : j not in Ai 


As 


Ae 


ll(I-P)X/3|| 



Our goal is to find upper bounds for the dimension qi = \Ai\ and the bias 
terms || (I - Pi)X/3|| and \\f3^\\ for ah the Ai in (5.2). By (5.1), (5.2) and 
Table 1, we have |s4| = 1 for each component so that ||s4|p = \A4\ and qi = 
\Ai\ = \A-i\ + 1^4! < q + ||s4|p. Our plan is to find upper bounds for the 
lengths of the vectors V14, W2 and P^, where 



A 



(5.4) VI, ^ -^Sii^/'q;.iS,, Wfc ^ (I - Pi)Xfc/3„ 

for j = 3, 4 and A: = 2, . . . , 6. Since X/3 = Xi/3i +X2/32 and (I -Pi)Xi/3i = 0, 
by (5.4) and (4.1), the fact that ||s4|p = \A4\ implies that 

(5.5) \Wuf > ^^%7^, I|w2f = 11(1 - Pi)X/3f . 

nc*[qi) 

Thus, we proceed to find upper bounds for ||vi4||, ||w2|| and H/JsH. 

We divide the rest of the proof into three steps. Step 1 proves that the 
quadratic 1 1 V14 1 p + 1 1 W2 1 ^ is no greater than a linear function of { 1 1 V14 1 1 , 1 1 W2 1 1 , 
II/35II1, IIP1X2/32II} with a stochastic slope. This step is crucial since the iden- 
tity and inequalities in the Karush-Kuhn-Tucker (5.1) must be combined 
in a proper way to cancel out the cross-product term of S4 and P^. Step 2 
translates the results of Step 1 into upper bounds for qi, ||w2|p and H/flsP, 
essentially with careful applications of the Cauchy-Schwarz inequality, for a 
suitable level of the random slope and the prescribed penalty levels A. The 
upper bounds in Step 2 are of the same form as in the conclusions of the the- 
orem, but still involve c^=(| A|) and c*(|j4|) with random ^ C ^1 U ^5 instead 
of the and c* specified in (2.13). Step 3 completes the proof by finding 
probabilistic bounds for the random slope and by showing |Ai U A5I <q* for 
the rank q* in (2.13). We need a lemma for the interpretation of (4.1). 

Lemma 1. Let c^,(m) and c*{m) he as in (4-1)- Let A^ C {l,...,p}, 
Xfc = (xj, J G Ak) and T^ik = X^Xfe/n. Then, 

(^•6) , IX < ll^ii v|| < ,| ■ |, , \\(3k\\i< 



c*(|^i|)-" " -c4\Ai\y nc4\Ak\) 
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for all V of proper dimension. Furthermore, if Ak V\A\ = 'Z, then 

(5.7) 11/3,11 +||Sii5]i,/3,|| <-_p-— 

where Pi is t/ie projection to the span of & Ai}. 

Remark 8. For A5 = ^ Ao,Pj = 0}, Lemma 1 gives (i = ll/^slP < 
(B + r)2)'^/{nc^), provided |AiU A| <q* under the SRC (2.13). 

Proof of Lemma 1. We only prove the inequality of (5.7), since the 
rest of the lemma follows directly from the Cauchy-Schwarz inequality and 
(4.1). Let v = -5]r/5]u,/3fc. Since (I - Pi)Xfc/3fc = Xiv + X^/J,, 

||(I-Pi)Xfc/3fcf = (v',/3',)(Xi,Xfc)'(Xi,Xfc)(^^^ 

>nc4\A^UA,\){\\vf + \\(3,f). 
The proof of Lemma 1 is complete. □ 

Step 1. In this step, we prove 

||vi4f + ||w2f < (llvuf + ||w2f )1/V'£| + (ll/^slll + m)A 

(5.8) 

+ (l|vu|| + ||P.X./3,||)(^)^^^ 

where u is a (random) unit vector in defined as 

(5 9) XiS]"/Q^iS4A/n- W2 

||XiS5"/Q4^S4A/n - W2II 

Since the eigenvalues of Sn are no smaller than c^:{qi), we assume, with- 
out loss of generahty, that Sn is of full rank. Since X.j3 = Xi/3i by (5.2), 
(5.3) gives X'i(y — X.i(3i) = SiA so that 

x;xi3i = x'ly - siA = x;xi/3i + x;x2/32 + x;£ - siA. 

This and the definition Xlj, = X^Xfc/n yield 

(5.10) 3i - /3i = Sr/Si2/32 + Sn'X'i£/n - S^/siA/n. 

Inserting (5.10) into the second part of (5.1), we find that A is a componen- 
twise upper bound of the absolute value of the vector 

X'2(y-x3) 

= X^(Xi/3i + X2/32 + £-Xi3i) 
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= nS2i/3i + nS22/32 + X'ae 

- ni:2i(/3i + Sr/Si2/32 + 5:ri^X'ie/n - Sf/siA/n) 

= n(S22 - S2iSr/Si2)/32 + (X^ - S2iSr/x;)£ + S2iSr/siA. 

Since n(S22 - 5]2iSr/5]i2) = X^(I - Pi)X2 and X'2 - = X'2(I - 

Pi), 

(5.11) - A < X^(I - Pi)X2/32 + X'2(I - Pi)£ + S2iSr/siA < A. 

Taking the inner product of XQ'^iS^ and (5.10), we obtain, after some alge- 
bra, that, by (5.4) and Table 1, 

vi4(vi3 + V14) 

(5.12) =S4Q4iS^/siAVf^ 

= slQ4iSri'Si2/32A + slQ4iSri'x'i£A/n + sl(/34 - 34)A. 
Similarly, the inner product of and (5.11) yields 

||w2f = /3^X'2(I-Pi)X2/32 

< -/3'2X'2(I - Pl)£ - /3'2S2lSri'Asi + II/32II1A 

= _w^£ _ s;i]^/l]i2/32A + II/32II1A. 

Since s'J^ > 0, by (5.1), and + s'^f3^ < + = WP^h + 

||/9olli — ll/^slli + ^1, by (2.4) and Table 1, the sum of (5.12) and the above 
inequality gives 

||vi4||^ + ||W2||^ + V14V13 

<(s^Q4i5]^/x;A/n-w'2)e 

(5.13) - s'3Q3iSn'Si2/32A + iWMi + s'M^ 
< ||XiSf/Q4;^S4A/n — vif2|| • |u'e| 

+ ||V13|| • ||Sn'^'l]l2/32llV^+ (Psili +??i)A, 

by the definition of u in (5.9). Since ||XiSi^^/^v|| = ||vf for all v G 
^2 is orthogonal to Xi, we find that ||XiSj~^^Q4]^S4A/n — W2II = 
(llvMp + ||W2||2)V2. Similarly, ||S~//^5]i2/32|| = ||PiX2/32||. Thus, by 
(5.13), ||vi4|p + ||vif2|P is bounded by 

(||vi4|p + ||v^^2 

f )V2|u'£| + (p^ll^ +^^)A+ (IIvmII + ||PlX2/32||)||vi3||. 

This implies (5.8), since, by (5.3), (5.4) and (5.6), 

||vi3f = {X^/n)4Q3i'S^lQ^iS3 < X^\As\/{nc,iqi)}. 
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Step 2. Let Bi = {q\^ / {nc* {qi)}f''^ and B2 = {q\^/{nc^{q V gi)})^/^ 
Consider, in this step, the event 

(5.14) |u'.P<^;''"/''=teVl)g?. 

We win later show that this event has high probabihty. We prove that, with 
gi = |Ai| and in the event (5.14), 

{qi - qy 

(5.15) 



< |l + 4c*(,.)T + ^X^J'^^Y' + 

I Xq V c^iqi) V yq J c^{qi) J 

provided that the Ai in (5.2) contains all labels j for "large" f3j, 
{j:j3j{X)^0oT j(^Ao} 

(5.16) 

C C {j : |xj{y - X3(A)}| = A or j ^ Aq}. 
Moreover, for general Ai satisfying (5.2), we prove that in the event (5.14), 
^fB! _ , , .AT.. , /7r^^ o , , 4^ 



(5.17) ||w2r < 3 + r/iA + V2(l + ^^5)7/2^2 + ^ + 
with C5 = c*(|A5|)/c*(|^iU^5|), and, for c*,5 = c*(|yli U ^sl) 

?2 / \l/2 a2 



(5.18) 

+ {^(||)"'vTT7^^7d7m^+2,.}'. 



By (5.14) and (5.5), we have lu'ep < (||vi4|p + B^)/4 so that 
(llvulP + ||w2|P)"=|u'e| < jdlvulP + IIW2IP) + |u'e|= 

iriiv„ip + t:i!±^V 



Inserting this inequality into (5.8), we find, by algebra, that 
(5.19) 



I ii2 ^11 II 2 
|V14|| +2IIW2II 



< ^ + 2{\\p,\\, + m)\ + 2(||vi4|| + IIX2/32II) 

2 Vnc^gi)/ 



1/2 
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We first prove (5.15) under (5.16). It follows from (5.16) and Table 1 that 
A = 0, so pslli = 0,\A3\=q<qi and HS^/Z^Sia/^all^/^ = \\Pi^2p2\\ = 
llPiXe/flell < V2, by (2.10). Thus, (5.19) implies 

3 B"^ 

I|vi4f + 2l|w2f < ^ + 27?iA + 2(||vi4||+7?2)B2. 



Since < c + 2bx implies < (6 + \/P~-i-c)^ < 2c + 46^ for x = ||vi4||, it 
follows that 

llvuf <Bf + 47?iA + iri2B2 + 45|. 

Since ||vi4|p > {qi — q)~^ / {nc* {qi)} , by (5.5), we find, by the definition of 
Bi and B2, that 

,+ ^ c*{qi)n(^ ^ , f X^q V'"^ ^q\^ 
(gi-9)+<g + ^^ 4r?iA + 4772^^ + 



A2 [ \c^{q\)nj nc^{qi) , 

This gives (5.15) by simple algebra. 

For general Ai satisfying (5.2), is no longer empty. Still, since |^3| + 

1^5! < (? by Table 1 and 12 f3 2]] ^/n = IIP1X2/32II, we have, by (5.6), 



that 



-1/2 



nc^{qi)J 



I]i//'5]i2/32ll^/^+||/95lliA 



\nc^{qi)J \nc^[q)J 

< , I . ] max(||PiX2/32||,||X5/35||). 
Vnc*(gi y q)J 

Moreover, it follows from Table 1, (4.1), (5.4), (5.7) and (2.10) that 



max( 11X2/32 II, IIX5/35II) < ^nc*(|^5|)||/95P + W^^M 

< VC^IIwsll + llXe/^ell < VC^||w2|| + (1 + 7^^)%, 

with C5 = c*(|^5|)/c*(|Ai U ^sl). Applying these inequalities to the right- 
hand side of (5.19), we find that 

II 1 1 2 3 1 1 1 1 2 
llvull +2"^2|| 

/ \2\A I \ 1/2 

<i?iV2 + 2ryiA + 2||vi4||^^ 

/ o\2 ^1/2 

+ 2( VC^||W2 II + (1 + VC;)V2) , I . 

Vnc*(gi V q) 
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< Bj/2 + 27?iA + 2(1 + v^)r?2 



+ 2S2(||vi4|| + \/2C^||w2||) 

since IA3I < q and S| = X'^q/{nc^{qi V 5). With 2||vi4||B2 < Uvuf + B^, 
the above inequahty gives 



|w2f < (2/3)(S2/2 + 27?iA + 2^/2(1 + VC^)r]2B2 + -B. 



+ (4/3)v^S2||w2||. 

Since <c + hx imphes that <2c + iP' for x = ||w2|| , this gives (5.17). 

The proof of (5.18) differs shghtly from that of (5.17). It suffices to con- 
sider the case of ll/flsH^nc^^s > 772- By Table 1, (5.4), the definition of r]2 
with (2.10) and (5.7), ||w2|| + r?2 > ||w5|| > ||/35||^nc*,5 with c*,5 = c=k(|Ai U 

Al), so \\^2f>{W^/nc^-m?- By (2.10) and ( 4.1), IIX2/32II < ??2 + 
IIX5/35II < r?2 + ^nc*{\A^\)\\l3^\\. Thus, since 2\\^^u\W X^\A:,\/ {nc,{qi)} < 
llvup + A^g/{rac*(gi)}, (5.19) implies that 
3 2 

2(ll/^5llV^C*,5 - r/2) 

<:^ + 2(||/35||i + r?i)A+ ^''^ 



2 nc*(gi) 

/ \2| /I Is 1/2 

+ 2fe + ||ft||vt[S))(;^) . 

Since pgHf < jyls] • H/Jgp and l^s] + l^s] =q, by Cauchy-Schwarz, 



< ||/35||A(v^4^+v'c*(|A5|)|^3|/c*(gi)) 

<||/35||Av^(l + c*(|A5|)/c.(gi))^/l 
It follows from the above two inequalities that 

ll/35lP^C*,5 

2(Bj , A2g / X^q 

3 L 2 nc^[qi) \nc^{qi) , 



+ 2\\f3,\\X^{l + c*{\A,\)/c,{q,)f^+3v2\mV^\-^2 

<-(^ + mX + V2(^^y^\^^-^ 

~3 \ 4 \nc^,{qi) J 2nc*(gi) 4 

+ ll/35llV^|:7^^(l + c*(IAI)/c*(9i))'/' + 2,?2j. 
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Again, since < c + 2bx implies that < 46^ + 2c for 6^ + c > 0, (5.18) 
follows. 

Step 3. In this step, we find probabilistic bounds. We shall take more 
generous bounds c^{m) = and c*{m) = c* in (4.1) for m < q* with the 
given constants and c* in (2.13) and consider the event 

... (gi vi)a2 



(5.20) gi < l^iU^sl <g*, |u'e|^< 



4c* n 



In this event, we have C5 = C = c* /c^ by (2.15) and c^^^s = c,,. Moreover, 
by (2.14) and the definition of Bi and B2 in Step 2, we have rf = r]iX/Bf, 
rl = ril/Bl and Bl = CBj. Thus, by (2.15), (2.16) and (2.17), in the event 

(5.20) , the assertions (5.15), (5.17) and (5.18) of Step 2 become 

(5.21) {qi -q)+ + q<il + irj + 4VCr2 + 4C)g + q = M*(A)g, 



(5.22) 



and 



M|(A)^ 
c*n 



C 3r2 



<-A-,+rt+r2VC + --^]Bt 



(5.23) + (^-^/C^rTC + 2r2j 

= ^{^+r? + -2yC(l + 2^rTC) + M + cg + ^c)}i3? 
= M3*(A) — . 

We note that since the constants ri, r2 and C depend only on (A, q, r/i , 772 , c* , c* 
and (5.16) simply requires larger Ai, (5.21) holds for all Ai satisfying (5.2). 
This is not the case in Step 2 since c^{qi) and c*{qi) are used without (5.20). 
In view of (5.2), (5.4) and Table 1, (5.21), (5.22) and (5.23) match the as- 
sertions of the theorem. Thus, it remains to show that (5.20) holds for all A 
satisfying (2.19) with the probability in (2.20). 

It follows from (5.9) and (5.4) that lu'ej is no greater than 



(5-24) Xm = max rnax 

\A\=ms£{±lY' 



^, Xa(X^X^)-1sA - (I - Pa)X/3 



|Xa(X:^Xa)-1sA - (I - Pa)X/3| 
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for qi = m>0. Define as Borel sets in ]R"^(P+^) 



O^o = {(X, e) : x*n < V + co)(m V 1) log(p V a„) Vm > mo}. 
Since 2(1 + co)(m V 1) log(p V a„) < (m V l)AV(4c*n) by (2.19), 



(5.25) (X, e) G Qrno lu'ep < ^^^'^/j^ qi>mo> 0. 

By (5.1), (5.16) and the continuity of /3(A) in A, we are able to choose Ai so 
that it changes one-at-a-time, beginning from the initial A = oo with (3 = 
to the lower bound in (2.19). Thus, since (A)g + l<q*, by (2.19) and 
(2.18) for such A, and since the path of qi cannot cross the gap between 
Ml{X)q, and Mf(A)g + 1 due to the continuity of Mf(A) in A, (5.21) and 
(5.25) imply that for all A satisfying (2.19), 

(X,£)GOg 

(5.26) 

q,^ #{j : |x,- (y - X3) I = A or j i A^} < Ml {X)q. 
By (5.24), Xm maximum of (^^)2™''^^ standard normal variables, so 

1-P{(X,£)gOo} 

oo / 



m=0 



(5.27) < 2"^' ^ ) exp(-(m V 1)(1 + cq) log(p V a„)) 



(pVa„)i+^o '^V(pVa„)i+^" 

The proof is complete, since (5.20) follows from (5.25), (5.26) and (5.27). 
□ 



6. Related results and final remarks. In this section, we discuss some 
related results and make a few final remarks. 

Meinshausen and Buhlmann (2006) and Zhao and Yu (2006) proved the 
sign-consistency P{sgn(/?j) = sgn(/3j) Vj} — > 1, with the convention sgn(O) = 
0, for the LASSO under (2.5) and the strong irrepresentable condition 

(6.1) ||I]2i5]j~/si lloo < 1 — K, for some K > 0, 

where = X^^.X^^^/n and si = sgn(/3i), with (3i = {(3j,j £ Ai), Ai = 
{j : Pj 7^ 0} and A2 = A\. We note that the definition of A\ here is different 
from (5.2) or (5.16). Between the two papers, Zhao and Yu (2006) imposed 
weaker conditions on {n,p, A} as 

(6.2) A > n'^V^'logP, min/3? > n^^^, n>rf-^q\o%'p, 
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for large n and some constants > 0, where q = ^{j : I3j ^ 0}. 

Although (6.2) is not sharp, a careful study of the arguments in these two 
papers reveals that under (6.1), condition (6.2) can be weakened to 



(for each component), for the sign-consistency, via (5.10) and (5.11), pro- 
vided e ~ A^(0,(T^I), ||xj|p = n Vj, 21og(p — q) < a2n oo and 21ogg < 
ain — > oo. This approach was taken in Wainwright (2006) under a stronger 
version of (6.3). Furthermore, for random designs X with i.i.d. Gaussian 
rows, Wainwright (2006) proved that the empirical version of his conditions 
on X follow from a population version of them. 

Compared with these results on the sign-consistency, our focus is the 
properties of the model A selected by the LASSO under milder conditions. 
We impose the sparse Riesz condition (2.13), instead of (6.1), to prove the 
rate-consistency (2.11) in Theorem 1 in terms of the sparsity, bias and the 
norm of missing large coefficients. We replace the n**^ , j = 1,2,3, in (6.2) 
by specific constants in, respectively, (2.19), Theorem 2 and Proposition 2. 
The second and third inequalities in (6.2) are not imposed as conditions in 
Theorem 1. Moreover, we allow many small nonzero coefficients, as long as 
the sum of their absolute values is of the order 0{qX/n). Desirable prop- 
erties of the LASSO estimator follow as in Section 3 once we establish the 
appropriate upper bound for the dimension \A\ of the LASSO selection. 

Zhao and Yu (2006) and Zou (2006) (for fixed p) showed that the irrep- 
resentable condition is necessary for the zero-consistency: Pj ^ <^ Pj ^ 
with high probability. It follows from the Karush-Kuhn-Tucker condition 
(5.1) that when e = 0, the weaker version of (6.1) with k = is necessary and 
sufficient for (2.2) to be zero-consistent. However, the irrepresentable con- 
dition is somewhat restrictive. As mentioned in Zhao and Yu (2006), (6.1) 
holds for all possible signs of /3 if and only if the norm of S2iS|Q^ is less 
than 1 linear mapping from (W, \\ ■ ||oo) to (MP || • ||oo)- Without 

knowing the set Ai of nonzero Pj, it is not clear how to verify (6.1), other 
than using simple bounds on the correlation x^x^ for j ^ k, as in Zhao and 
Yu (2006). Since ||Sjj^^si|p is typically of the order ||si|p = q, (6.1) is not 
a consequence of the ^2-based sparse Riesz condition (2.13) in general. For 
certain large data sets, it is reasonable to expect large ||sip = q, even under 
the assumption g<Cmin(n,p). In this case, (6.1) is quite restrictive. 

Bunea, Tsybakov and Wegkamp (2006) and var de Geer (2007) studied 
convergence rates of ||X/3 — X/3p and ||/3 — /3||i under the sparsity condition 
(2.5) and for random designs of the form Xij =ipj{x'^), where are i.i.d. 
variables and ij^j are suitable basis functions, that is, with the rows of X being 
i.i.d. copies of (^i, . . . ,i^p) as in Section 4.2. Bunea, Tsybakov and Wegkamp 



(6.3) 
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(2006) obtained (3.5) and (3.6) for a = 1 under two sets of conditions. The 
first set includes the lower bound > in (4.5), uniform upper bounds for 
ll^jlloo and q < cop^,^/n/ logp as in Proposition 2(ii). The second set relaxes 
the restriction on q to q < coy^n/ logp, but relies on the correlation bound 
\cor r{^pj,^pk)\ < l/(45g) for f3k ^ = f3j, which has the flavor of the strong 
irrepresentable condition (6.1). In fact, the sample version of this condition 
implies |S2iSr/si| < l/{45Amm(Sii)}. van de Geer (2007) considered more 
general forms of loss function and risk bounds under maxj<p ||^j||oo < Kn- An 
interesting aspect of her result is the use of D((3*) in place of q in her version 
of (3.5) and (3.6), where /3* is the solution of (2.2) at y = X/3 and D(/3) is an 
upper bound of (E/3,^o l&iDV^I ^iV^iP • Since D{(3) = / 0}/p, 
works under the Riesz condition and van de Geer (2007) does not assume 
(4.5) or (6.1), her upper bounds are indeed of a more general form than 
(3.5) and (3.6) when the rows of X are i.i.d., although the relationship of 
her risk bounds to {n,p,q} is not explicit. Bounds on ||X/3 — X/3|p and 
11/9 — /3||i do not directly imply the rate-consistency (2.11), but the converse 
is true for the LASSO as in Theorem 3, even for all the || • ||q losses with 
a >1. Greenshtein and Ritov (2004) proved the persistency of a LASSO- 
like estimator in prediction risk under a condition on the order of ||/3||i as 
n — > oo. Since a different performance measurement is concerned, their result 
does not require (4.5) or (6.1). 

For the estimation of /3, Donoho (2006) proved the ^2-consistency of the 
LASSO estimator for p x n when X is a certain normalization of a random 
matrix with i.i.d. N(0, 1) entries. Candes and Tao (2007) proved that the 
LASSO-like Dantzig estimator (3 has the oracle property 

under the sparsity condition (2.5) and a "uniform uncertainty principle". 
Since (3.6) with a = 2 is comparable to their result, we have provided an 
affirmative answer to the question posed in Efron, Hastie and Tibshirani 

(2007) , page 2363. SRC (2.13) may still hold. Recent results on random ma- 
trices are used by Candes and Tao to bound 6{m). For example, they allow 
qmaxjkUjf, >c l/(logp)^ when X/-y/re is a random sample of n rows from a 
px p orthonormal matrix {ujk). Their results certainly have implications on 
the validity of (2.13) and (4.1) for random design matrices. 

Meinshausen and Yu (2006) proved that under (2.6) and certain other 
regularity conditions, 

(6.4) ||3 -9f< O, {'^^\ + O (X) = »P(1), 

V n ci{mx)J \mxJ 
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where mx = c*(n Ap)£'||y|pn/A^. They also obtained a version of (6.4), with 
q/mx replaced by B? /m\ when c^:{mx) is bounded away from zero and 
(3 belongs to a certain weak ^^-ball of radius R with < a < 1. In spirit, 
our paper and theirs both study the LASSO under conditions on the sparse 
eigenvalues c^=(m) and c*(m), instead of (6.1), and both allow p^n and 
many small nonzero coefficients. While our focus is on the properties of the 
selected model A in (2.3), specifically its sparsity \A\, bias (2.8) and the 
norm of the missing large coefficients (2.9), theirs is on the ^2-loss ||/3 — 
Inspired by their results, and as suggested by the reviewers, we added Section 
3 in the revision to discuss the implications of our results on the LASSO 
estimation. Still, the results in the two papers are complementary to each 
other. While our results are based on the upper bound (2.21) for the sparsity, 
Meinshausen and Yu (2006) used \A\ < c*(|^|)||y|p n/A^. This is a crucial 
technical difference between the two papers. 

Our main result asserts that as far as the rate consistency (2.11) in model 
selection is concerned, the performance of the LASSO for correlated designs 
under the sparse Riesz condition is comparable to its performance in the 
much simpler orthonormal designs, as in Example 1. Although the LASSO 
selects all coefficients of order larger than y^A/n, by Theorem 2, and is sign- 
consistent under (6.1) and (6.3), it could miss coefficients of orders between 
^JqXjn and the threshold level A/n. This discrepancy with a factor of ^fq 
is due to the interference of the estimation bias of the LASSO estimator 
/3(A) with model selection and cannot be removed for large q. For example, 
the loss measured in (2.23) cannot be recovered after the LASSO selection. 
A possible remedy for this discrepancy is adaptive LASSO, but for p ^ 
n the choice of the initial estimator is unclear [Zou (2006)]. Huang, Ma 
and Zhang (2007) proved the sign consistency of adaptive LASSO under 
certain partial orthogonality condition on the pairwise correlations among 
vectors {y,xi, . . . ,Xp}. Threshold and other selection methods can be used 
to remove small coefficients in ^ n after LASSO selection based on the 
selected data (y,X^) [cf. (3.6) for a = cxo, Meinshausen and Yu (2006) and 
the references therein]. 
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